Out of all the advantages of attending UC Santa Barbara, one of the
most often cited is the lovely weather year-round in Goleta and SB. This
stable, temperate climate is created by a variety of factors, not least
of all the coastal location and the tall Santa Ynez mountains in the
north acting as a sort of shield against weather conditions further
inland. Evaporated moisture from the sea is trapped by the mountains,
creating a humid yet pleasant climate for residents of this small city.
Frequently, the humidity and other weather conditions result in a haze
blanketing the entire Goleta/SB area, while other days the marine layer
takes the place of this haze covering the sky.
When I began searching for a topic for this project, I knew that I
wanted to apply my data science skills to a real-world problem that was
personal to me, and weather prediction was what I settled on. I searched
for weather datasets all over the internet, but unfortunately the NOAA
did not have complete data records, and hundreds if not thousands of
observations were missing from every dataset I downloaded. However, I
finally found a good dataset to work with.
After several weeks of searching, I discovered that Weather
Underground had complete historical records of daily temperature,
humidity, wind speed, dew point, and air pressure for the past several
years. The data was recorded at Santa Barbara Municipal Airport. Since
WU had no CSV or XLSX files I could download, I manually copy-pasted the
data tables into Microsoft Excel for every month from May 2015 to May
2022 and converted the file to a CSV. I cleaned the data by hand as well
— there were 2557 observations in total, of which 27 were completely
missing. I assumed that these were days when the weather station was out
of service. In the end, I now have 2530 observations with which to
predict humidity. With that, let’s take a look at the dataset itself.
Before anything else, we must load all of our packages. You can see the full list of packages I am using below.
library(tidymodels)
library(ISLR)
library(ISLR2)
library(discrim)
library(poissonreg)
library(corrr)
library(klaR)
library(dplyr)
library(ggplot2)
library(glmnet)
library(parsnip)
library(janitor)
library(corrplot)
library(randomForest)
library(rpart)
library(rpart.plot)
library(regclass)
library(broomstick)
Now that that’s out of the way, it’s time to read in my data. I used
clean_names() from the janitor package to make
my data easier to wrangle. I will also make sure R can read my date
values as dates rather than characters, and add a
year_month variable in order to calculate monthly averages.
This makes some of my data, such as precipitation, temperature, and
humidity, far easier to plot.
weather <- read.csv("weatherdatanew.csv") %>% clean_names()
# We clean the variable names to make the code easier to write
weather$date <- as.Date(weather$date, format="%m/%d/%y")
# Cleaning date values
weather <- weather %>% mutate(year_month=format(date, "%y/%m"), .before=date)
# Adding year_month variable in order to create monthly averages for weather values
Now it’s time to start my EDA. Below, I created a correlation matrix of all of my variables in order to highlight important relationships between the predictors and response.
weather %>%
dplyr::select(where(is.numeric)) %>%
cor() %>%
corrplot(type = 'lower', diag = FALSE,
method = 'color')
# Correlation matrix of all numeric variables in order to ascertain which correlations stand out
The correlation matrix has some interesting implications -
while some of the relationships were anticipated, such as the
correlation between average wind speed and maximum wind speed, some
others were less obvious. In particular, the moderate negative
correlation between average humidity and maximum wind speed and the
strong positive correlation between average dew point and humidity stood
out to me. Thinking about it, these relationships make sense. High wind
speed means that water molecules in the air are spread further apart
from each other, so it’s harder for them to condense - thus, dew point
is lower. Average dew point is also related to humidity in that dew
point is the air temperature at which relative humidity would be 100% -
water then begins to condense into dew. If humidity is high, it makes
sense that the dew point would not have to be as low for relative
humidity to reach 100% and for water to start condensing.
I’d like to find out if there are any patterns in my data, so I’ve
plotted several line graphs below, each exploring one of the variables.
humidity_avg <- aggregate(average_humidity ~ year_month, weather, mean)
# Aggregate means of monthly humidity in a new data frame
h <- ggplot(data=humidity_avg, aes(x=year_month, y=average_humidity, group=1)) +
geom_line() + labs(y="Average Humidity", x="Year/Month")
# Store plot for monthly average humidity
h + theme(axis.text.x = element_text(angle = 90))
# Show plot while rotating x-axis values 90 degrees to prevent crowding
While this graph is somewhat rough, it still makes sense -
humidity tends to peak most in the warmest months, as increased sunlight
and heat causes more water to evaporate from the ocean and flora/fauna
on land.
temp_avg <- aggregate(average_temperature_o_f ~ year_month, weather, mean)
# Aggregate means of monthly temp in a new data frame
dew_avg <- aggregate(average_dew_point_o_f ~ year_month, weather, mean)
# Aggregate means of monthly dew point in a new data frame
total1 <- merge(humidity_avg, temp_avg, by="year_month")
total <- merge(total1, dew_avg, by="year_month")
ggplot(data=total, aes(x=year_month)) +
geom_line(aes(y = average_temperature_o_f, group=1, color="Temperature")) +
geom_line(aes(y = average_dew_point_o_f, group=1, color="Dew Point")) +
theme(axis.text.x = element_text(angle = 90)) + labs(y="Degrees Fahrenheit", x="Year/Month")
As expected, monthly average temperature and dew point closely
follow each other. As temperatures increase, the temperature at which
water condenses from the air grows higher, as there is more water vapor
in the air. This correlates well with the humidity graph as well, which
has peaks in warmer months.
precip_month <- aggregate(precipitation_in ~ year_month, weather, sum)
ggplot(data=precip_month, aes(x=year_month)) + geom_line(aes(y=precipitation_in, group=1)) +
theme(axis.text.x = element_text(angle = 90)) + labs(y="Total Precipitation", x="Year/Month")
This graph is hardly surprising. Precipitation peaks regularly between November and April every year, with the height of the peaks varying depending on whether a drought is currently in effect as well as other environmental factors.
ggplot(data=weather, aes(x=date)) + geom_point(aes(y=average_pressure_in_hg, group=1)) +
theme(axis.text.x = element_text(angle = 90)) + labs(y="Average Pressure (inHg)", x="Date")
Since Santa Barbara Airport is very close to sea level,
pressure stays very stable in the area. Still, some peaks and troughs
are visible in this scatterplot, though they are shallow. Notably, four
points stand out as outliers below the graph - I imagine these are
recording errors, or days of exceptionally low pressure prior to a
storm. There are a few other points that stick out slightly below and
above the main line of the scatterplot, but they seem to be legitimate
high- or low-pressure days rather than recording errors.
ggplot(data=weather, aes(x=date)) + geom_line(aes(y=average_wind_speed_mph, group=1, color="Average Wind Speed")) +
geom_line(aes(y=maximum_wind_speed_mph, group=1, color="Maximum Wind Speed")) +
theme(axis.text.x = element_text(angle=90)) + labs(y="Miles per Hour", x="Date")
Interestingly, average and maximum wind speed both dip every year around December/January. I’m not sure exactly why this is, but I believe it has to do with the fact that air pressure tends to stay low around that time of year, so there is not much movement of air.
Now it’s time to model the data. I’m choosing four models: a
regression decision tree, a random forest with bagging, a boosted trees
model, and a linear regression model.
Since my dataset is relatively small (2530 observations), I’m
choosing a 75-25 training-testing split and only creating 5
cross-validation folds with 5 repeats. I’m stratifying on the outcome
variable as well.
set.seed(3945)
weather_split <- initial_split(weather, prop=0.75, strata=average_humidity)
weather_train <- training(weather_split)
weather_test <- testing(weather_split)
weather_fold <- vfold_cv(weather_train, v=5, strata=average_humidity, repeats=5)
Date variables are tiresome to integrate into our models, especially
considering they do not fit neatly into numeric values, so we will leave
them out when creating our recipe.
weather_recipe <- recipe(average_humidity ~ average_temperature_o_f + average_dew_point_o_f + maximum_wind_speed_mph + average_wind_speed_mph + average_pressure_in_hg + precipitation_in, data=weather_train)
First we fit a regression decision tree. We are tuning
cost_complexity and tree_depth, then choosing the best model out of a
tuning grid with 5 levels. After selecting the best decision tree, we
will plot it so we can visualize it.
set.seed(3945)
regtreespec <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("regression")
regtreewf <- workflow() %>%
add_model(regtreespec %>% set_args(cost_complexity = tune(), tree_depth = tune())) %>%
add_formula(average_humidity ~ maximum_wind_speed_mph + average_wind_speed_mph + average_pressure_in_hg + precipitation_in + average_temperature_o_f + average_dew_point_o_f)
param_grid <- grid_regular(cost_complexity(), tree_depth(), levels = 5)
tune_res <- tune_grid(
regtreewf,
resamples = weather_fold,
grid = param_grid
)
bestcomplexity <- select_best(tune_res, metric = "rmse")
regtreefinal <- finalize_workflow(regtreewf, bestcomplexity)
regtreefinalfit <- fit(regtreefinal, data = weather_train)
getcp(extract_fit_engine(regtreefinalfit))
regtreefinalpruned <- prune(extract_fit_engine(regtreefinalfit), cp=2.225653e-05)
save(regtreespec, regtreewf, param_grid, tune_res, bestcomplexity, regtreefinal, regtreefinalfit, regtreefinalpruned, file="regtree_models.RData")
I used the getcp() function to find the tree that
would result in the lowest level of cross-validation error, that being
this tree. Now we can look at our decision tree: it’s extremely large,
with dozens of nodes.
This
looks like it might be a bit overfitted, but we’ll see how it turns out
when we try to predict our testing data. A detailed image of the whole
tree is available at the link in DecisionTree.txt. Note that the file is
about 1.09 GB and so will take some time to load - you will need to zoom
in to view the whole tree. One main thing to note is that the tree model
only chose two predictors: average_temperature_o_f
and average_dew_point_o_f. Evidently, these two predictors
are the most relevant to humidity, with other factors not predicting it
as accurately.
Let’s look at how well this tree predicts the testing set.
weather_test %>%
mutate(
.pred=predict(regtreefinalpruned, weather_test)
) %>%
rmse(truth=average_humidity, estimate=.pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.56
Our tree is actually surprisingly accurate! Using the cp value
provided by getcp(), we find that the root mean square
error is 2.56%. Considering that the humidity ranges from 27.4% to
99.5%, this is actually a pretty good margin of error. Our decision tree
is fairly accurate.
We can also see a scatterplot of the actual data points compared to
the predicted values:
weather_test %>%
mutate(
.pred=predict(regtreefinalpruned, weather_test)
) %>%
ggplot(aes(average_humidity, .pred)) +
geom_abline() +
geom_point(alpha = 0.5)
It looks like the points follow the line pretty well!
Interestingly, the points start to get closer to the line as the
percentages grow. The error grows larger the smaller the percentage.
This is likely because Santa Barbara is located on the coastline, so
humidity is quite stable in the 70-90% range. Because of this stability,
most of the data points are located in that range as well, giving the
model more data with which to predict humidity. Lower humidity levels
are comparatively rare, so a lot more “noise” appears in the model.
Next we will fit a Random Forest model with bagging. We set
mtry to 6 since there are 6 predictors.
set.seed(3945)
bagging_spec <- rand_forest(mtry = 6) %>%
set_engine("randomForest") %>%
set_mode("regression")
baggingwf <- workflow() %>%
add_model(bagging_spec %>% set_args(min_n = tune())) %>%
add_formula(average_humidity ~ maximum_wind_speed_mph + average_wind_speed_mph + average_pressure_in_hg + precipitation_in + average_temperature_o_f + average_dew_point_o_f)
param_grid_bag <- grid_regular(min_n(), levels = 5)
tune_res_bag <- tune_grid(
baggingwf,
resamples = weather_fold,
grid = param_grid_bag
)
bestcomplexitybag <- select_best(tune_res_bag, metric = "rmse")
bagfinal <- finalize_workflow(baggingwf, bestcomplexitybag)
bagfinalfit <- fit(bagfinal, data = weather_train)
Now let’s see how this model performs on the testing set.
augment(bagfinalfit, new_data = weather_test) %>%
rmse(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.45
The RMSE for this model is significantly lower than for the
Decision Tree model - we have an RMSE of 1.45% as compared to 2.56%, a
full 1.11% lower.
Now we fit a Boosted Tree model. We set trees to 5000
and mtry to 6 like the previous model, then tune
tree_depth and min_n, using the
xgboost engine.
set.seed(3945)
boost_spec <- boost_tree(trees = 5000, learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
boostwf <- workflow() %>%
add_model(boost_spec) %>%
add_formula(average_humidity ~ maximum_wind_speed_mph + average_wind_speed_mph + average_pressure_in_hg + precipitation_in + average_temperature_o_f + average_dew_point_o_f)
param_grid_boost <- grid_regular(learn_rate(range=c(-10, -1)), levels = 5)
tune_res_boost <- tune_grid(
boostwf,
resamples = weather_fold,
grid = param_grid_boost
)
bestcomplexityboost <- select_best(tune_res_boost, metric = "rmse")
boostfinal <- finalize_workflow(boostwf, bestcomplexityboost)
boostfinalfit <- fit(boostfinal, data = weather_train)
Let’s see how the Boosted Tree model performs on the testing data.
augment(boostfinalfit, new_data = weather_test) %>%
rmse(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.39
Our RMSE for this model is 1.39%, a bit lower than it was for
the Random Forest model! This is 1.17% less than our Decision Tree
RMSE.
Finally, we fit a Linear Regression model. This is straightforward
compared to the other models.
set.seed(3945)
lm_model <- linear_reg() %>%
set_engine("lm")
lm_wflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(weather_recipe)
lm_fit <- fit(lm_wflow, weather_train)
Now let’s see the performance of the Linear Regression model on the
testing set.
augment(lm_fit, new_data = weather_test) %>%
rmse(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.26
The RMSE for this model is 2.26% - though this is certainly on the
higher side, it’s still 0.3% lower than the Decision Tree model.
Now that we’ve calculated the RMSE for each model, we can see that
the Boosted Trees model is the most accurate for predicting the testing
data given the training set, with an RMSE of only 1.39%. The Random
Forest with Bagging is the second most accurate at 1.45%, Linear
Regression is third at 2.26%, and our Decision Tree is last at 2.56%.
However, this is just for the RMSE metric. Let’s see how accurate each
model is when the yardstick is R-squared and MAE (Mean Absolute Error)
as well.
weather_test %>%
mutate(
.pred=predict(regtreefinalpruned, weather_test)
) %>%
rsq(truth=average_humidity, estimate=.pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.967
augment(bagfinalfit, new_data = weather_test) %>%
rsq(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.990
augment(boostfinalfit, new_data = weather_test) %>%
rsq(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.990
augment(lm_fit, new_data = weather_test) %>%
rsq(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.974
weather_test %>%
mutate(
.pred=predict(regtreefinalpruned, weather_test)
) %>%
mae(truth=average_humidity, estimate=.pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 1.93
augment(bagfinalfit, new_data = weather_test) %>%
mae(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 0.970
augment(boostfinalfit, new_data = weather_test) %>%
mae(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 0.956
augment(lm_fit, new_data = weather_test) %>%
mae(truth = average_humidity, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 1.68
These are interesting patterns. When we use R-squared as our
yardstick, we see that – while the Decision Tree still performs the
worst and Linear Regression still performs second-worst – our Random
Forest and Boosted Trees models are equally accurate in predicting the
testing data. In addition, all of the models have an R-squared value of
over 0.95. Knowing that, it seems like splitting hairs to compare them,
but this is still useful information.
When we use MAE as our yardstick, we see the same pattern as the
RMSE. The models in order of best to worst performance are Boosted Trees
(0.956% MAE), Random Forest (0.97% MAE), Linear Regression (1.68% MAE),
and Decision Tree (1.93% MAE).
Finally, I can conclude that the Boosted Trees model is the best
model for predicting our weather data. Although the R-squared comparison
showed that the Random Forest and Boosted Trees models were the same,
The Mean Absolute Error and Root Mean Square Error comparisons both had
Boosted Trees performing the best. Upon researching the reason for why
this might be, I found that while Boosted Trees can be more prone to
overfitting than Random Forests, they still perform better on average.
Since we had 5 cross-validation folds with 5 repeats each, we can safely
conclude that our models are safeguarded against overfitting. A question
I would like to explore in the future would be whether I can predict all
of the weather parameters, or at least most of them, using date and time
of year as predictor variables. This project taught me a lot about
modeling data using the models that are most appropriate for the type of
data that I have. I also learned important lessons about gathering my
data from reputable and consistent sources. I would like to replicate
this project or work on a similar one in Python in the future to grow my
capabilities in data modeling and machine learning.